618 research outputs found

    Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

    Get PDF
    open4siHigh-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our co-design approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISC-V cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8 percent of the total logic-base (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5 W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs.openAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, LucaAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, Luc

    A Hybrid Instruction Prefetching Mechanism for Ultra Low-Power Multicore Clusters

    Get PDF
    The instruction memory hierarchy plays a critical role in performance and energy efficiency of ultralow-power (ULP) processors for the Internet-of-Things (IoT) end-nodes. This is mainly due to the extremely tight power envelope and area budgets, which imply small instruction-caches (I-Cache) operating at very low supply voltages (near-threshold). The challenge is aggravated by the fact that multiple processors, fetching in parallel, require plenty of bandwidth from the I-Caches. In this letter, we propose a low-cost and energy efficient hybrid instruction-prefetching mechanism to be integrated with a ULP multicore cluster. We study its performance for a wide range of IoT applications, from cryptography to computer vision, and show that it can effectively improve the hit-rate of almost all of them to above 95% (average performance improvement of over 2 \times ). In addition, we designed our prefetcher and integrated it in a 4-cores cluster in 28 nm fully-depleted silicon-on-insulator (FDSOI) technology. We show that system's power consumption increases only by about 11% and silicon area by less than 1%. Altogether, a total energy reduction of 1.9x is achieved, thanks to more than 2x performance improvement, enabling a significantly longer battery life

    Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

    Full text link
    High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.Comment: 14 page

    An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

    Full text link
    Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74pJ/op; and seizure detection with encrypted data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE Transactions on Circuits and Systems - I: Regular Paper

    A -1.8V to 0.9V body bias, 60 GOPS/W 4-core cluster in low-power 28nm UTBB FD-SOI technology

    Get PDF
    A 4-core cluster fabricated in low power 28nm UTBB FD-SOI conventional well technology is presented. The SoC architecture enables the processors to operate 'on-demand' on a 0.44V (1.8MHz) to 1.2V (475MHz) supply voltage wide range and -1.2V to 0.9V body bias wide range achieving the peak energy efficiency of 60 GOPS/W, (419\u3bcW, 6.4MHz) at 0.5V with 0.5V forward body bias. The proposed SoC energy efficiency is 1.4x to 3.7x greater than other low-power processors with comparable performance

    Characterization and Implementation of Fault-Tolerant Vertical Links for 3-D Networks-on-Chip

    Get PDF
    Through silicon vias (TSVs) provide an efficient way to support vertical communication among different layers of a vertically stacked chip, enabling scalable 3-D networks-on-chip (NoC) architectures. Unfortunately, low TSV yields significantly impact the feasibility of high-bandwidth vertical connectivity. In this paper, we present a semi-automated design flow for 3-D NoCs including a defect-tolerance scheme to increase the global yield of 3-D stacked chips. Starting from an accurate physical and geometrical model of TSVs: 1) we extract a circuit-level model for vertical interconnections; 2) we use it to evaluate the design implications of extending switch architectures with ports in the vertical direction; moreover, 3) we present a defect-tolerance technique for TSV-based multi-bit links through an effective use of redundancy; and finally, 4) we present a design flow allowing for post-layout simulation of NoCs with links in all three physical dimensions. Experimental results show that a 3-D NoC implementation yields around 10% frequency improvement over a 2-D one, thanks to the propagation delay advantage of TSVs and the shorter links. In addition, the adopted fault tolerance scheme demonstrates a significant yield improvement, ranging from 66% to 98%, with a low area cost (20.9% on a vertical link in a NoC switch, which leads a modest 2.1% increase in the total switch area) in 130 nm technology, with minimal impact on very large-scale integrated design and test flows

    High Performance Ambipolar Field-Effect Transistor of Random Network Carbon Nanotubes

    Get PDF
    Ambipolar field-effect transistors of random network carbon nanotubes are fabricated from an enriched dispersion utilizing a conjugated polymer as the selective purifying medium. The devices exhibit high mobility values for both holes and electrons (3 cm(2)/V.s) with a high on/off ratio (10(6)). The performance demonstrates the effectiveness of this process to purify semiconducting nanotubes and to remove the residual polymer

    Overall Survival With Palbociclib And Fulvestrant in Women With HR+/HER2– ABC: Updated Exploratory Analyses of PALOMA-3, a Double-Blind, Phase 3 Randomized Study

    Get PDF
    Purpose: To conduct an updated exploratory analysis of overall survival (OS) with a longer median follow-up of 73.3 months and evaluate the prognostic value of molecular analysis by circulating tumor DNA (ctDNA). Patients and methods: Patients with hormone receptor−positive/human epidermal growth factor receptor 2−negative (HR+/HER2−) advanced breast cancer (ABC) were randomized 2:1 to receive palbociclib (125 mg orally/d; 3/1 week schedule) and fulvestrant (500 mg intramuscularly) or placebo and fulvestrant. This OS analysis was performed when 75% of enrolled patients died (393 events in 521 randomized patients). ctDNA analysis was performed among patients who provided consent. Results: At the data cutoff (August 17, 2020), 258 and 135 deaths occurred in the palbociclib and placebo groups, respectively. The median OS (95% CI) was 34.8 months (28.8−39.9) in the palbociclib group and 28.0 months (23.5−33.8) in the placebo group (stratified hazard ratio 0.81; 95% CI, 0.65−0.99). The 6-year OS rate (95% CI) was 19.1% (14.9−23.7) and 12.9% (8.0−19.1) in the palbociclib and placebo groups, respectively. Favorable OS with palbociclib plus fulvestrant compared with placebo plus fulvestrant was observed in most subgroups, particularly in patients with endocrine-sensitive disease, no prior chemotherapy for ABC, low circulating tumor fraction, and regardless of ESR1, PIK3CA, or TP53 mutation status. No new safety signals were identified. Conclusions: The clinically meaningful improvement in OS associated with palbociclib plus fulvestrant was maintained with >6 years of follow-up in patients with HR+/HER2− ABC, supporting palbociclib plus fulvestrant as a standard of care in these patients. Trial Registration: ClinicalTrials.gov Identifer: NCT0194213

    Overall Survival with Palbociclib and Fulvestrant in Women with HR+/HER2− ABC: Updated Exploratory Analyses of PALOMA-3, a Double-blind, Phase III Randomized Study

    Get PDF
    Purpose: To conduct an updated exploratory analysis of overall survival (OS) with a longer median follow-up of 73.3 months and evaluate the prognostic value of molecular analysis by circulating tumor DNA (ctDNA). Patients and Methods: Patients with hormone receptor–positive/ human epidermal growth factor receptor 2–negative (HRþ/HER2) advanced breast cancer (ABC) were randomized 2:1 to receive palbociclib (125 mg orally/day; 3/1 week schedule) and fulvestrant (500 mg intramuscularly) or placebo and fulvestrant. This OS analysis was performed when 75% of enrolled patients died (393 events in 521 randomized patients). ctDNA analysis was performed among patients who provided consent. Results: At the data cutoff (August 17, 2020), 258 and 135 deaths occurred in the palbociclib and placebo groups, respectively. The median OS [95% confidence interval (CI)] was 34.8 months (28.8–39.9) in the palbociclib group and 28.0 months (23.5–33.8) in the placebo group (stratified hazard ratio, 0.81; 95% CI, 0.65– 0.99). The 6-year OS rate (95% CI) was 19.1% (14.9–23.7) and 12.9% (8.0–19.1) in the palbociclib and placebo groups, respectively. Favorable OS with palbociclib plus fulvestrant compared with placebo plus fulvestrant was observed in most subgroups, particularly in patients with endocrine-sensitive disease, no prior chemotherapy for ABC and low circulating tumor fraction and regardless of ESR1, PIK3CA, or TP53 mutation status. No new safety signals were identified. Conclusions: The clinically meaningful improvement in OS associated with palbociclib plus fulvestrant was maintained with >6 years of follow-up in patients with HRþ/HER2 ABC, supporting palbociclib plus fulvestrant as a standard of care in these patients
    corecore